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ABSTRACT 

Contextual bandit algorithms have become popular for on- 
line recommendation systems such as Digg, Yahoo! Buzz, 
and news recommendation in general. Offline evaluation of 
the effectiveness of new algorithms in these applications is 
critical for protecting online user experiences but very chal- 
lenging due to their "partial-label" nature. Common practice 
is to create a simulator which simulates the online environ- 
ment for the problem at hand and then run an algorithm 
against this simulator. However, creating simulator itself 
is often difficult and modeling bias is usually unavoidably 
introduced. In this paper, we introduce a replay method- 
ology for contextual bandit algorithm evaluation. Different 
from simulator-based approaches, our method is completely 
data-driven and very easy to adapt to different applications. 
More importantly, our method can provide provably unbi- 
ased evaluations. Our empirical results on a large-scale news 
article recommendation dataset collected from Yahoo! Front 
Page conform well with our theoretical results. Furthermore, 
comparisons between our offline replay and online bucket 
evaluation of several contextual bandit algorithms show ac- 
curacy and effectiveness of our offline evaluation method. 

Categories and Subject Descriptors 

H. 3.5 [Information Systems]: On-line Information Ser- 
vices; 1.2.6 [Computing Methodologies]: Learning 

General Terms 

Algorithms, Experimentation 

Keywords 

Recommendation, multi-armed bandit, contextual bandit, 
offline evaluation, benchmark dataset 

I. INTRODUCTION 

Web-based content recommendation services such as Digg, 
Yahoo! Buzz and Yahoo! Today Module (Figure[T]) leverage 
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user activities such as clicks to identify the most attractive 
contents. One inherent challenge is how to score newly gen- 
erated contents such as breaking news, especially when the 
news first emerges and little data are available. A person- 
alized service which can tailor contents towards individual 
users is more desirable and challenging. 

A distinct feature of these applications is their "partial- 
label" nature: we observe user feedback (click or not) for an 
article only when this article is displayed. A key challenge 
thus arises which is known as the exploration/exploitation 
tradeoff: oir oire hand, we want to exploit (i.e., choose arti- 
cles of higher quality estimates to promote our business of 
interest), but on the other than, we have to explore (i.e., 
choose articles with lower quality estimates to collect user 
feedback so as to improve our article selection strategy in 
the long run). The balance between exploration and ex- 
ploitation may be modeled as a "contextual bandit" [TS] — a 
subclass of reinforcement learning problems [26] , and is also 
present in many other important Web-based applications 
such as online ads display and search query suggestion, etc. 

An ideal way to evaluate a contextual-bandit algorithm 
is to conduct a bucket test, in which we run the algorithm 
to serve a fraction of live user traffic in the real recommen- 
dation system. However, not only is this method expen- 
sive, requiring substantial engineering efforts in deploying 
the method in the real system, but it may also have negative 
impacts on user experience. Furthermore, it is not easy to 
guarantee replicable comparison using bucket tests as online 
metrics vary significantly over time. Ojfline evaluation of 
contextual-bandit algorighms thus becomes valuable when 
we try to optimize an online recommendation system. 

Although benchmark datasets for supervised learning such 
as the UCI repository [9] have proved valuable for empiri- 
cal comparison of algorithms, collecting benchmark data to- 
wards reliable offline evaluation has been difficult in bandit 
problems. In our application of news article recommenda- 
tion on Yahoo! Front Page, for example, each user visit 
results in the following information stored in the log: user 
information, the displayed news article, and user feedback 
(click or not). When using data of this form to evaluate a 
bandit algorithm offline, we will not have user feedback if the 
algorithm recommends a different news article than the one 
stored in the log. In other words, data in bandit-style ap- 
plications only contain user feedback for recommendations 
that were actually displayed to the user, but not all candi- 
dates. This "partial-label" nature raises a difficulty that is 
the key difference between evaluation of bandit algorithms 
and supervised learning ones. 



Common practice for evaluating bandit algorithms is to 
create a simulator and then run the algorithm against it. 
With this approach, we can evaluate any bandit algorithm 
without having to run it in a real system. Unfortunately, 
there are two major drawbacks with this approach. First, 
creating a simulator can be challenging and time-consuming 
for practical problems. Second, evaluation results based on 
artificial simulators may not reflect the actual performance 
since simulators are only rough approximations of real prob- 
lems and unavoidably contains modeling bias. 

Our contributions are two-fold. First, we describe and 
study an offline evaluation method for bandit algorithms, 
which enjoys valuable theoretical guarantees including un- 
biasedness and accuracy. Second, we verify the method's 
effectiveness by comparing its evaluation results to online 
bucket results using a large volume of data recorded from 
Yahoo! Front Page. These positive results not only encour- 
age wide use of the proposed method in other Web-baesd 
applications, but also suggest a promising solution to create 
benchmark datasets from real-world applications for bandit 
algorithms. 

Related Work. 

Unbiased evaluation has been studied before under dif- 
ferent settings. While our unbiased evaluation method is 
briefly sketched in an earlier paper [19] and may be inter- 
preted as a special case of the exploration scavenging tech- 
nique [IT] , we conduct a thorough investigation in this work, 
including improved theoretical guarantees and positive em- 
pirical evidence using online bucket data. 

2. CONTEXTUAL BANDIT PROBLEMS 

The multi-armed bandit problem is a classic and popu- 
lar model for studying the exploration-exploitation tradeoff. 
Despite the simplicity of the model, it has found wide appli- 
cations in important problems like medical treatment allo- 
cation, and recently, in challenging, large-scale problems like 
Web content optimization [2] [19] . Different from the classic 
multi-armed bandit problems, we are particularly concerned 
with a more interesting setting where for each round contex- 
tual information is available for decision making. 

2.1 Notation 

For the purpose of this paper, we consider the multi-armed 
bandit problem with contextual information. Following pre- 
vious work [18], we call it a contextual bandit problem^ For- 
mally, we define by ^ = {1, 2, . . . , K} the set of arms, and 
a contextual-bandit algorithm A interacts with the world in 
discrete trials t = 1, 2, 3, . . .. In trial t: 

1. The world chooses a feature vector Xt known as the 
context. Associated with each arm a is a real-valued 
payoff vt.a £ [0, 1] that can be related to the context 
Xt in an arbitrary way. We denote by X the (possibly 
infinite) set of contexts, and {rt,i, . . . , rt.K) the payoff 
vector. Furthermore, we assume (xt, rt,i, . . . , rt.x) is 
drawn i.i.d. from some unknown distribution D. 

2. Based on observed payoffs in previous trials and the 
current context Xt, A chooses an arm at £ A, and 



^In the literature, contractual bandits are sometimes called 
bandits with covariate [21], associative reinforcement learn- 
ing [14], ba ndit s with expert advice [6], bandits with side 
information [27], and associative bandits [25| . 



receives payoff rt.at- It is important to emphasize here 
that no feedback information (namely, the payoff rt.a) 
is observed for unchosen arms a ^ at. 

3. The algorithm then improves its arm-selection strategy 
with all information it observes, {x.t,at,o.t,rt,at)- 

In this process, the total T-trial payoff of A is defined as 



Ga(T)='Ed 



where the expectation Ed [•] is defined w.r.t. the i.i.d. gener- 
ation process of (xt, rt,i, . . . , rt,K) according to distribution 
D (and the algorithm A as well if it is not deterministic). 
Similarly, given a policy n that maps contexts to actions, 
TT : X A, we define its total T-trial payoff by 



G.(r)='ED 



t=l 



T ■ Ed ['-i,,r(xi)] , 



where the second equality is due to our i.i.d. assumption. 
Given a reference set 11 of policies, we define the optimal 
expected T-trial payoff with respect to H as 

G*(r)=*maxG,r(r). 

For convenience, we also define the per-trial payoff of an 
algorithm or policy, which is defined, respectively, by 



dcf 
SA = 
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GA(r) 

r 
G^(r) 

T 



Much research in multi-armed bandit problems is devoted to 
developing algorithms with large total payoff. Formally, we 
may search for an algorithm minimizing regret with respect 
to the optimal arm-selection strategy in H. Here, the T-trial 
regret Ra{T) of algorithm A with respect to D is defined by 



i?A(T)='G*(T) 



-Ga(T). 



(1) 



An important special case of the general contextual bandit 
problem is the well-known K-armed bandit in which the con- 
text Xt remains constant for all t. Since both the arm set 
and contexts are constant at every trial, they have no effect 
on a bandit algorithm, and so we will also refer to this type 
of bandit as a context-free bandit. 

In the example of news article recommendation, we may 
view articles in the pool as arms, and for the t-th user visit 
(trial t), one article (arm) is chosen to serve the user. When 
the served article is clicked on, a payoff of 1 is incurred; 
otherwise, the payoff is 0. With this definition of payoff, 
the expected payoff of an article is precisely its click-through 
rate (CTR), and choosing an article with maximum CTR 
is equivalent to maximizing the expected number of clicks 
from users, which in turn is the same as maximizing the 
total expected payoff in our bandit formulation. 

2.2 Existing Bandit Algorithms 

The fundamental challenge in bandit problems is the need 
for balancing exploration and exploitation. To minimize the 
regret in Equation ([l]), an algorithm A exploits its past ex- 
perience to select the arm that appears best. On the other 
hand, this seemingly optimal arm may in fact be subopti- 
mal, due to imprecision in A's knowledge. In order to avoid 



this undesired situation, A has to explore the world by ac- 
tually choosing seemingly suboptimal arms so as to gather 
more information about them (c.f., step|3]in the bandit pro- 
cess defined in the previous subsection) . Exploration can in- 
crease short-term regret since some suboptimal arms may be 
chosen. However, obtaining information about the arms' av- 
erage payoffs (i.e., exploration) can refine A's estimate of the 
arms' payoffs and in turn reduce long-term regret. Clearly, 
neither a purely exploring nor a purely exploiting algorithm 
works best in general, and a good tradeoff is needed. 

There are roughly two classes of bandit algorithms. The 
first class of algorithms attempt to minimize the regret as 
the number of steps increases. Formally, such algorithms 
A ensure the quantity Ra{T)/T vanishes over time as T 
grows. While low-regret algorithms have been extensively 
studied for the context-free K-aimed bandit problem [7], 
the more general contextual bandit problem has remained 
challenging. Another class of algorithms are based on Bayes 
rule, such as Gittins index methods [T3]. Such Bayesian ap- 
proaches may have competitive performance with appropri- 
ate prior distributions, but are often computationally pro- 
hibitive without coupling with approximation 

The Appendix describes a few representative low-regret 
algorithms used in our experiments, but it should be noted 
that our method is algorithm independent, and so may be 
applied to evaluate Bayesian algorithms as well. 

3. UNBIASED OFFLINE EVALUATION 

Compared to machine learning in the more standard su- 
pervised learning setting, evaluation of methods in a contex- 
tual bandit setting is frustratingly difficult. Our goal here 
is to measure the performance of a bandit algorithm A, that 
is, a rule for selecting an arm at each time step based on 
the preceding interactions and current context (such as the 
algorithms described above). 

More formally, we want to estimate the per-trial payoff 



5a = 



Ga(T) 
T 



■ T 
t = l 



Here, at is the t-th action chosen by A, and in general de- 
pends on the previous contexts, actions, and observed re- 
wards. Because of the interactive nature of the problem, it 
would seem that the only way to do this evaluation unbias- 
edly is to actually run the algorithm online on "live" data. 
However, in practice, this approach is likely to be infeasible 
due to the serious logistical challenges such as extensive en- 
gineering resources and potential risks on user experiences. 
Rather, we may only have offline data available that was 
collected at a previous time using an entirely different log- 
ging policy. Because payoffs are only observed for the arms 
chosen by the logging policy, which are likely to differ from 
those chosen by the algorithm A being evaluated, it is not 
at all clear how to evaluate A based only on such logged 
data. This evaluation problem may be viewed as a special 
case of the so-called "off-policy policy evaluation problem" 
in the reinforcement-learning literature [22]. In the multi- 
armed bandit setting, however, there is no need for "tempo- 
ral credit assignment", and thus more efficient solutions are 
possible. 

One solution is to build a simulator to model the bandit 
process from the logged data, and then evaluate A with the 
simulator. Although this approach is straightforward, the 



modeling step is often very expensive and difficult, and more 
importantly, it often introduces modeling bias to the sim- 
ulator, making it hard to justify reliability of the obtained 
evaluation results. In contrast, we propose an approach that 
is unbiased, grounded on data, and simple to implement. 

In this section, we describe a sound technique for carrying 
out such an evaluation, assuming that the individual events 
are i.i.d., and that the logging policy chose each arm at each 
time step uniformly at random. Although we omit the de- 
tails, this latter assumption can be weakened considerably so 
that any randomized logging policy is allowed and the algo- 
rithm can be modified accordingly using rejection sampling, 
but at the cost of decreased data efficiency. 

More precisely, we suppose that there is some unknown 
distribution D from which tuples are drawn i.i.d. of the form 
(x,ri, . . . jTk), each consisting of observed context and un- 
observed payoffs for all arms. We also posit access to a long 
sequence of logged events resulting from the interaction of 
the uniformly random logging policy with the world. Each 
such event consists of the context vector x, a selected arm a, 
and the resulting observed payoff r^. Crucially, this logged 
data is partially labeled in the sense that only the payoff 
is observed for the single arm a that was chosen uniformly 
at random. 

Our goal is to use this data to evaluate a bandit algorithm 
A, which is a (possibly randomized) mapping for selecting 
the arm at at time t based on the history ht-i oit — 1 preced- 
ing events together with the current context. Therefore, the 
data serves as a benchmark, with which people can evaluate 
and compare difi'erent bandit algorithms. As in supervised 
learning, having such benchmark sets will allow easier, repli- 
cable comparisons of algorithms in real-life data. 

It should be noted that this section focuses on contextual 
bandit problems with constant arm sets of size K. While 
this assumption leads to easier exposition and analysis, it 
may not be satisfied in practice. For example, in the news 
article recommendation problem studied in SectionlH the set 
of arms is not fixed: new arms may become available while 
old arms may be dismissed. Consequently, the events are 
independent but drawn from non-identical distributions. We 
do not investigate this setting formally although it is possible 
to generalize our setting in Section[2]to this variable arm set 
case. Empirically, we find the evaluator is very stable. 

3.1 An Unbiased Offline Evaluator 

In this subsection, for simplicity of exposition, we take this 
sequence of logged events to be an infinitely long stream. 
But we also give explicit bounds on the actual finite number 
of events required by our evaluation method. A variation 
for finite data streams is studied in the next subsection. 

The policy evaluator is shown in Algorithm [T] [T9] . The 
method takes as input a bandit algorithm A and a desired 
number of "valid" events T on which to base the evaluation. 
We then step through the stream of logged events one by 
one. If, given the current history ht~i, it happens that the 
policy A chooses the same arm a as the one that was se- 
lected by the logging policy, then the event is retained (that 
is, added to the history), and the total payoff Ga updated. 
Otherwise, if the policy A selects a different arm from the 
one that was taken by the logging policy, then the event 
is entirely ignored, and the algorithm proceeds to the next 
event without any change in its state. 

Note that, because the logging policy chooses each arm 



Algorithm 1 Policy_Evaluator (with infinite data stream). 
0: Inputs: T > 0; bandit algorithm A; stream of events 5* 
1: fto {An initially empty history} 
2: Ga <— {An initially zero total payoff} 
3; for t = 1,2,3, ... ,r do 
4: repeat 

5; Get next event (x, a,ra) from S 
6: until A(/it_i,x) — a 

7: /it CONCATENATE(/lt_i,(x,a,ra)) 

8: Ga ^ Ga + 
9: end for 

10: Output: Ga/T 



uniformly at random, each event is retained by this algo- 
rithm with probability exactly 1/K, independent of every- 
thing else. This means that the events which are retained 
have the same distribution as if they were selected by D. 
As a result, we can prove that two processes are equivalent: 
the first is evaluating the policy against T real-world events 
from D, and the second is evaluating the policy using the 
policy evaluator on a stream of logged events. Theorem [1] 
formalizes this intuition. 

Theorem 1. For all distributions D of contexts and pay- 
offs, all algorithms A, allT, all sequences of events Kt, and 
all stream S containing i.i.d. events from a uniformly ran- 
dom logging policy and D, we have 

Pr (/it) = Pr(/iT). 

Policy_Evaluator{A,S) '^'^ 

Furthermore, let L be the number of events obtained from 
the stream to gather the length-T history /it, then 

1. the expected value of L is KT, and 

2. for any S G (0, 1), with probability at least 1 ^ S , L < 
2K{T + \n{l/S)). 

This theorem says that every history hr has an identical 
probability in the real world as in the policy evaluator. Any 
statistics of these histories, such as the estimated per-trial 
payoff Ga/T returned by Algorithm [TJ are therefore unbi- 
ased estimates of the respective quantities of the algorithm 
A. Hence, by repeating Algorithm [1] multiple times and then 
averaging the returned per-trial payoffs, we can accurately 
estimate the total per-trial payoff gf, of any algorithm A and 
respective confidence intervals. Further, the theorem guar- 
antees that, with high probability, 0{KT) logged events are 
sufficient to retain a sample of size T. 

Proof. The first statement can be proved by mathemati- 
cal induction on the time steps of event streams [T5]. Second, 
since each event from the stream is retained with probabil- 
ity exactly 1/K, the expected number required to retain T 
events is exactly KT. Finally, the high-probability bound 
is an application of the multiplicative form of Chernoff's in- 
equality. □ 

Given the unbiasedness guarantee, one may expect con- 
centration is also guaranteed; that is, the evaluator becomes 
more and more accurate as T increases. Unfortunately, such 
a conjecture is false for general bandit algorithms, as ex- 
plained in Example [3] of the next section. 



Algorithm 2 Policy_Evaluator (with finite data stream). 
0: bandit algorithm A; stream of events 5* of length L 
1: /lo {An initially empty history} 
2: Ga {An initially zero total payoff} 
3: T <— {An initially zero counter of valid events} 
4: for f = 1,2,3, ... ,L do 
5: Get the t-th event (x, a,ra) from S 
6: if A(/it_i,x) = a then 
7: ht 4- CONCATENATE(/lt_i, (x, a,ra)) 

8: Ga ^ Ga -f r„ 

9: T^T+1 

10: else 

11: ht ^ ht-i 

12: end if 

13: end for 

14: Output: Ga/T 



3.2 Sample Complexity Result 

Next, we consider a situation that may be more relevant to 
practical evaluation of a static policy when we have a finite 
data set 5* containing L logged events. Roughly speaking, 
the algorithm steps through every event in D as in Algo- 
rithm [1] and obtains an estimate of the policy's average per- 
trial payoff based on a random number of valid events. The 
detailed pseudocode in Algorithmic! 

Algorithm [2] is very similar to Algorithm [1] The only 
difference is that the number of valid events, denoted T in 
the pseudocode, is a random number with mean L/K. For 
this reason, the output of Algorithm [2] (namely, Ga/T) may 
not be an unbiased estimate of the true per-trial payoff of A. 
However, the next theorem shows that the final value of T 
will be arbitrarily close to L/K with high probability as long 
as L is large enough. Using this fact, the theorem further 
shows that the returned value of Algorithm [2] is an accurate 
estimate of the true per-trial payoff with high probability 
when A is a fixed policy that chooses action at independent 
of the history ht-\. To emphasize that A is a fixed policy, 
the following theorem and its proof use tt instead of A. 

Theorem 2. For all distributions D over contexts and 
payoffs, all policies tt, all data stream S containing L i.i.d. 
events drawn from a uniformly random logging policy and 
D, and all S £ (0, 1), we have, with probability at least 1 — <5, 
that 




Therefore, for any g > g-K, with high probability, the the- 
orem guarantees that the returned value Gtt/T is a close 
estimate of the true value g-n with error on the order of 

As L increases, the error decreases to at 

the rate of 0(1/%/L). This error bound improves a previ- 
ous result |17l Theorem 5] for a similar offiine evaluation 
algorithm and similarly provides a sharpened analysis for 
the T = 1 special case for policy evaluation in reinforce- 
ment learning [15] . Section [4] provides empirical evidence 
matching our bound. 

Proof. The proof involves a couple applications of the 
multiplicative Chernoff/Hoeffding bound [20) Corollary 5.2]. 
To simplify notation, we use Pr(-) and E[-] in the proof to 



denote the probability and expectation with respect to ran- 
domness generated by tt and S. Let (xt, at, rt,at) be the t-th 
event in the stream S, Vt be the (random) indicator that 
at matches the arm chosen by poUcy tt in the context (xt). 
Then, T = Y^f^i Vt, Gn = Ylt=i Vrt.at, and the returned 
value of Algorithm [2] is Gtt/T. We bound the denominator 
and numerator, respectively. 

First, since at is chosen uniformly at random, we have 

E[Vi] = 1/K for aU t and thus E \^'^^^ Vtj = L/K. Using 
the multiplicative form of Chernoff 's bound, we have 

„2 ^ 
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for any 71 > 0. Let the right-hand side above be 5/2 and 
solve for 71: 



71 



3A' 4 



Similarly, since at is uniformly chosen, we have E = 

LgTr/K. Applying the multiplicative Chernoff bound again, 
we have for any 72 > that 



Pr 



K 



> 



< 2exp 



Let the right-hand side above be 5/2 and solve for 72: 



(3) 



72 



— In- 



Now applying a union bound over the probabilistic state- 
ments in Equations ([2]) and ((Sjl, we can see that, with prob- 
ability at least 1 — 5, the following holds: 



K 

g^jl - 72) 
K 



71 < < 1 + 71 



L 

< ^ < 
- L - 



K 

ff^(l + 72) 



K 



These two inequalities together imply 



(71 + 72) 



o 



which finishes the proof. 
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Given Theorem [21 one might wonder if a similar result 
holds for general bandit algorithms. Unfortunately, the fol- 
lowing example shows that such a concentration result is 
impossible in general. 

Example 3. Consider a contextual bandit problem with 
K — 2 and x G {0, 1} in which rt,\ = 1 and rt,2 ~ for 
all t = 1,2, .. .. Suppose x is defined by a uniform random 
coin flip. Let A be an algorithm that operates as follows: if 
Xi = 1 the algorithm chooses at = 1 for all t; otherwise, 
it always chooses at = 2. Therefore, the expected per-trial 
payoff of A is g/\ — 0.5. However, in any individual run of 
the algorithm, its T-step total reward Ga is either T (if A 
always chooses at = 1) or (if A always chooses at = 0), 

and therefore, \^Ga/T — gA^ = 0.5 no matter how large T is. 

This counterexample shows that an exponential tail style 
deviation bound does not hold for general bandit algorithms 
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Figure 1: A snapshot of the "Featured" tab in the 
Today Module on the Yahoo! Front Page [19]. By 
default, the article at Fl position is highlighted at 
the story position. 



that are dependent on history. Not all hope is lost though — 
there are some known algorithms for which deviation bounds 
are provable; for example, epoch-greedy algorithm [18| . 
UCBl [5], and EXP3.P Furthermore, as commented ear- 
lier, we can always repeat the evaluation process multiple 
times and then average the outcomes to get accurate esti- 
mate of the algorithm's performance. In the next section, 
we show empirically that Algorithm [1] returns highly stable 
results for all algorithms we have tried. 

4. CASE STUDY 

In this section, we apply the offline evaluation method in 
the previous section to a large-scale, real- world problem with 
variable arm sets to validate the effectiveness of our offline 
evaluation methodology. Specifically, we provide empirical 
evidence for: (i) the unbiasedness guarantee in Theorem [1] 
(ii) the convergence rate in Theorem[2j (iii) the low variance 
of the evaluation result, and (iv) the effectiveness of the 
evaluation method when the arm set may change over time. 

While the proposed evaluation methodology has been ap- 
plied to the same application |19j . our focus here is on the 
effectiveness of the offline evaluation method itself. More 
importantly, we also provide empirical evidence of unbiased- 
ness for not only fixed policies but also learning algorithms, 
by relating offline evaluation metric to online performance 
in large-scale production buckets on Yahoo! Front Page. 

We will first describe the application and show how it can 
be modeled as a contextual bandit problem. Second, we 
compare the offline evaluation result of a policy to its on- 
line evaluation to show our evaluation approach is indeed 
unbiased and it gives results that are asymptotically con- 
sistent when the number of valid events (the quantity T in 
Algorithms [T] and [2]) is large. Third, we provide empirical 
evidence that our offline evaluation method gives very sta- 
ble results for a few representative algorithms. Finally, we 
study the relationship between offline evaluation results to 
online bucket performance for three bandit algorithms. 

4.1 News Article Recommendation on Yahoo! 
Front Page Today Module 

The Today Module is the most prominent panel on the 
Yahoo! Front Page, which is also one of the most visited 



pages on the Internet; see a snapshot in Figure [T] The de- 
fault "Featured" tab in the Today Module highlights four 
high-quality news articles, selected from an hourly-refreshed 
article pool maintained by human editors. As illustrated in 
Figure [T] there are four articles at footer positions, indexed 
by F1-F4. Each article is represented by a small picture 
and a title. One of the four articles is highlighted at the 
story position, which is featured by a large picture, a title 
and a short summary along with related links. By default, 
the article at Fl is highlighted at the story position. A user 
can click on the highlighted article at the story position to 
read more details if interested in the article. The event is 
recorded as a story click. To draw visitors' attention, we 
would like to rank available articles according to individual 
interests, and highlight the most attractive article for each 
visitor at the story position. In this paper, we focus on 
selecting articles for the story position. 

This problem can be naturally modeled as a contextual 
bandit problem. Here, it is reasonable to assume each user 
visits and their click probabilities on articles to be (approx- 
imately) i.i.d. Furthermore, each user has a set of features 
(such as age, gender, etc.) from which the click probability 
of a specific article may be inferred; these features are the 
contextual information used in the bandit process. Finally, 
we may view articles in the pool as arms, and the payoff 
is 1 if the user clicks on the article and otherwise. With 
this definition of payoff, the expected payoff of an article is 
precisely its CTR, , and choosing an article with maximum 
CTR is equivalent to maximizing the expected number of 
clicks from users, which in turn is the same as maximizing 
the per-trial payoff in our bandit formulation. 

We setup cookie-based buckets for evaluation. A bucket 
consists of a certain amount of visitors. A cookie is a string 
of 13 letters randomly generated by the web browser as 
an identifier. We can specify a cookie pattern to create a 
bucket. For example, we could let users with the starting 
letter "a" in their cookies fall in one bucket. In a cookie- 
based bucket, a user is served by the same policy, unless the 
user changes the cookie and then belongs to another bucket. 

For offline evaluation, millions of events were collected 
from a "random bucket" from Nov. 1, 2009 to Nov. 10, 2009. 
In the random bucket, articles are randomly selected from 
the article pool to serve users. There are about 40 million 
events in the offline evaluation data set, and about 20 articles 
available in the pool at every moment. 

We focused on user interactions with the story article at 
the story position only. The user interactions are recorded 
as two types of events, user visit event and story click event. 
We chose CTR as the metric of interest, which is defined as 
the ratio between the number of story click events and the 
number of user visits. To protect business-sensitive infor- 
mation, we only report relative CTRs which are defined as 
the ratio between true CTRs and a hidden constant. 

4.2 Unbiasedness Analysis 

Given a policy, the unbiasedness of the offline evaluation 
methodology can be empirically verified by comparing offline 
metrics with online performance. We set up another cookie- 
based bucket, noted as "serving bucket", to evaluate online 
performance. In the serving bucket, a spatio-temporal al- 
gorithm [3] was deployed to estimate article CTRs0 The 
article with the highest CTR estimate (also known as the 

^Note that the CTR estimates are updated every 5 minutes. 



winner article) was then used to serve users. We extracted 
the serving policy from the "serving bucket", i.e., the best 
article at every 5 minutes from Nov. 1 2009 to Nov. 10 2009. 
Note that it is in the same period of time of the offline eval- 
uation data set, ensuring that the sets of available arms are 
the same in both the serving and random buckets. Then, 
we used Algorithm [T] to evaluate the serving policy on the 
events from the random bucket for the offline metric. 

It should be noted that the outcome of our experiments 
are not a foregone conclusion of the mathematics presented, 
because the setting differs in some ways from the i.i.d. as- 
sumption made in our theorems as is typical in real- world ap- 
plications. In particular, events are not exchangeable since 
old articles leave the system and new ones enter, there are 
sometimes unlogged business rule constraints on the serv- 
ing policy, and users of course do not behave independently 
when they repeatedly visit the same site. We finesse away 
this last issue, but the first two are still valid. 

In the serving bucket, a winner article usually remains the 
best for a while. During its winning time, the user repeat- 
edly sees the same article. At the same time, the users in 
the random bucket are very likely to see different articles 
at user visit events, due to the random serving policy. It 
is conceivable that the more a user views the same article, 
the less likely the user clicks on the article. This conditional 
effect violates the i.i.d. assumption in Theorem [1] Fortu- 
nately, the discrepancy can be removed by considering CTR 
on distinct views. For each user, consecutive events of view- 
ing the same article are counted as one user visit only. The 
CTR on distinct views in the serving bucket measures user 
interactions to the winner articles across the whole session. 
Regarding the offline evaluation metric as in Algorithm [21 
the subset of events sampled in the random bucket also mea- 
sures user interactions with the winner articles across the 
whole session. 

We first compared online and offline per-article CTRs. 
Only winner articles that were viewed more than 20, 000 
times in the serving bucket are used in the plot so that their 
online CTRs are accurate enough to be treated as ground 
truth. Figure [2] shows that the CTR metric evaluated offline 
are very close to the CTR estimated online. 

We next compared online and offline CTRs at the policy 
level. These CTRs are the overall CTR of the serving policy 
aggregated over all articles. Figure |3] shows the two CTRs 
are very close on each individual day. 

Both sets of results corroborate the unbiasedness guar- 
antee of Theorem [1] a property of particular importance in 
practice that is almost impossible with simulator-based eval- 
uation methods. Therefore, our evaluation method provides 
a solution that is accurate (like bucket tests) without the 
cost and risk of running the policy in the real system. 

4.3 Convergence Rate Analysis 

We now study how the difference between offfine and on- 
line CTRs decreases with more data (namely, the quantity T 
in the evaluation methods). To show the convergence rate, 
we present the estimated error versus the number of samples 
used in offline evaluation. Formally, we define the estimated 
error by e = |c — c|, where c and c are the true CTR and 
estimated CTR, respectively. 

Figures [4] and [5] present convergence rate of the CTR es- 
timate error for various articles and the online serving pol- 
icy, respectively, and the red curve is 1/VT — the functional 
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Figure 2: Articles' CTRs in the online bucket versus 
offline estimates. 



Figure 3: Daily overall CTRs in the online bucket 
versus offline estimates. 



form of the upper confidence bound. These results suggest 
that, in practice, we can observe the error decay rate pre- 
dicted by Theorem [2] for reasonably stable algorithms such 
as those evaluated. 

4.4 Low Variance of Evaluation Results 

In this subsection, we chose three representative algo- 
rithms (c.f., Appendix) to illustrate the low variance of the 
offline evaluation technique: 

• e-greedy, a stochastic, context-free algorithm; 

• UCB, a deterministic, context-free variant of UCBl [5]; 

• LinUCB [19], a deterministic, contextual bandit algo- 
rithm that uses ridge regression to estimate arm pay- 
offs based on contexts. 

Each of the algorithms above has one parameter: e for e- 
greedy and a for UCB and LinUCB (see [TU] for details). We 
fixed the parameters to reasonable values: e = 0.4 and a = 
1. We collected over 4,000,000 user visits from a random 
bucket on May 1, 2009. To evaluate variance, we subsampled 
this data so that each event is used with probability 0.5. We 
ran each algorithm 100 times on independently subsampled 
events and measure the returned CTR using Algorithm [2] 

Table [T] summarizes statistics of CTR estimates for the 
three algorithms It shows that the evaluation results are 
highly consistent across different random runs. Specifically, 
the ratio between standard deviation and the mean CTR 
is about 2.4% for e-greedy, and below 1.5% for the UCB 
and LinUCB which have known algorithm-specific deviation 
bounds. 

This experiment demonstrates empirically that our evalu- 
ation method can give results that have small variance for a 
few natural algorithms, despite the artificial counterexample 
in Section [31 suggesting that with large datasets the result 
obtained from only one run of our evaluation method are 
already quite reliable. 

^In the terminology of [T^], the CTR estimates reported 
in Table [T] are for the "learning bucket". Similar standard 
deviations are found for the so-called "deployment bucket". 



algorithm 


mean 


std 


max 


min 


e-greedy 

UCB 

LinUCB 


1.2664 
1.3278 
1.3867 


0.0308 
0.0192 
0.0157 


1.3079 
1.3661 
1.4268 


1.1671 
1.2812 
1.3491 



Table 1: Statistics of CTR estimates for three rep- 
resentative algorithms using Algorithm [21 

4.5 Consistency with Online Performance 

Sections 14.2! and 14.31 give evidence for the accuracy when 
the ofHine evaluation method is applied to a static policy 
that is fixed over time. In this section, we show complimen- 
tary accuracy results for learning algorithms that may be 
viewed as history-dependent, non-fixed policies. In particu- 
lar, we show the consistency between our offline evaluation 
and online evaluation of three e-greedy bandit models: 

• Estimated Most Popular (EMP): we estimate CTR of 
available articles over all users via a random explo- 
ration bucket, and then serve users in the EMP bucket 
by the article of the highest CTR; 

• Segmented Most Popular (SEMP): we segment users into 
18 clusters based on their age/gender information. We 
estimate CTR of available articles within each clus- 
ter, and for each user cluster serve the article with the 
highest CTR. Note that users' feedback may change 
serving policy for the cluster in future trials; 

• Contextual Bandit Model (CEMP): this is a fine-grained 
personalized model. In this model, we define a sepa- 
rate context for each user based on her age, gender, 
etc. For each available article, we maintain a logis- 
tic regression model to predict its CTR given the user 
context. When a user comes, in the CEMP bucket, we 
estimate the CTRs of all articles for the user and se- 
lect the article with highest estimated CTR to display. 
Users with different contexts may be served by differ- 
ent articles in this bucket, while each user's feedback 
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Figure 4: Decay rate of error in articles' CTR esti- 
mates with increasing data size. The red curve plots 
the function 



Figure 5: Decay rate of error in overall CTR esti- 
mates with increasing data size. The red curve plots 
the function l/\fx. 



will affect other users' click probability estimation on 
this article in future trials. 

For all these three bandit models, we set up three online 
bcookie-based buckets to deploy the three bandit models re- 
spectively. We also set up another bucket to collect random 
exploration data. This random data is used to update the 
states of three online bandit models and also used for our 
offline evaluation. For a given period, we obtain the per-trial 
payoffs J(a"""° for A G {EMP, SEMP, CEMP}. Using data in the 
random exploration bucket, we run our offline evaluation for 
these three models in the same period and get the per-trial 
payoffs ffX'"'"^ 

It is important to note that there were unlogged business- 
rule constraints in all online serving buckets of Today Mod- 
ule; for instance, an article may be forced to shown in a given 
time window. Fortunately, our data analysis (not reported 
here) suggested that such business rules have roughly the 
same multiplicative impact on an algorithm's online CTR, 
although this multiplicative factor may vary across different 
days. To remove effects caused by business rules, we report 
the ratio of offline CTR estimate and online CTR for each 
model: pA = 5a'^'"°/5a"''"°- If our offline evaluation metric 
is truthful of an algorithm's online metric in the absence of 
business rules, then it is expected that, for a given period 
of time like one day, pA should remain constant ideally and 
does not depend on the algorithm A. 

In Figure [6l we present a scatter plot of Pemp vs. Psemp of 
16 days, from May 03, 2009 to May 18, 2009. In each day, 
we have about 2,000,000 views (i.e., user visits) in each of 
the four online buckets. The scatter plot indicates a strong 
linear correlation. The slope in least squares linear regres- 
sion is 1.019 and the standard deviation in residue vector 
is 0.0563. We observed that business rules give almost the 
same impact on CTR in buckets for the two serving policies. 

SEMP is a relatively simple bandit algorithm similar to EMP. 
In the next experiment, we study the online/offline correla- 
tion of a more complicated contextual bandit serving policy 
CEMP, in which CTRs are estimated using logistic regression 



on user features and a separate logistic regression model is 
maintained for each article. Figure [7] shows the scatter plot 
of Pemp vs Pcemp in 18-day data from May 22, 2010 to June 
8, 2010. In each day, we have 2, 000, 000 ~ 3, 000, 000 views 
in each online bucket. The scatter plot again indicates a 
strong linear correlation. In this comparison, the slope and 
standard deviation in residue vector is 1.113 and 0.075 re- 
spectively. It shows that the difference between our offline 
and online evaluation, caused by business rules and other 
systemic factors, e.g. time-out in user feature retrieval and 
delays in model update, is comparable across bandit models. 
Although the daily factor is unpredictable, the relative per- 
formance of bandit models in offline evaluation is reserved 
in online buckets. Thus, our offfine evaluator can provide 
reliable comparison of different models on historical data, 
even in the presence of business rules. 

5. CONCLUSIONS AND FUTURE WORK 

This paper studies an offline evaluation method of ban- 
dit algorithms that relies on log data directly rather than 
on a simulator. The only requirement of this method is 
that the log data is generated i.i.d. with arms chosen by an 
(ideally uniformly) random policy. We show that the eval- 
uation method gives unbiased estimates of quantities like 
total payoffs, and also provide a sample complexity bound 
for the estimated error when the algorithm is a fixed pol- 
icy. The evaluation method is empirically validated using 
real-world data collected from Yahoo! Front Page for the 
challenging application of online news article recommenda- 
tion. Empirical results verify our theoretical guarantees, and 
demonstrate both accuracy and stability of our method us- 
ing real online bucket results. These encouraging results 
suggest the usefulness of our evaluation method, which can 
be easily applied to other related applications such as online 
refinement of ranking results [21] and ads display. 

Our evaluation method, however, ignores {K — 1)/A' frac- 
tion of logged data. Therefore, it does not make use of all 
data, which can be a problem when K is large or when 
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Figure 6: Scatter plot of ratios of offline metric and 
online bucket performance of 16 days in 2009. 
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Figure 7: Scatter plot of ratios of offline metric and 
online bucket performance in 18 days in 2010. 

data is expensive to obtain. Furthermore, in some risk- 
sensitive applications, while we can inject some randomness 
during data collection, a uniformly random policy might be 
too much to hope for due to practical constraints (such as 
user satisfaction). As we mentioned earlier, our evaluation 
method may be extended to work for data collected by any 
random policy with rejection sampling, which enjoys similar 
unbiasedness guarantees, but reduces the data efficiency at 
the same time. An interesting future direction, therefore, is 
exploiting problem-specific structures to avoid exploration 
of the full arm space. A related question is how to make use 
of non-random data for reliable offiine evaluation, for which 
a recent progress has been made [24| . 
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by decaying e appropriately, the per-step regret, Ri\[T)/T, 
converges to with probability 1 [23) . 

The e-greedy strategy is unguided since it picks a ran- 
dom arm for exploration. Intuitive, when an arm is clearly 
suboptimal, it need not be explored. In contrast, another 
class of algorithms generally known as "upper confidence 
bound" algorithms [161 [5] use a smarter way to balance ex- 
ploration and exploitation. In particular, in trial t, these 
algorithms estimate both the mean payoff flt,a of each arm 
a as well as a corresponding confidence interval ct^a, so that 
I Ai, a — Ma I < Ct,a holds with high probability. They then se- 
lect the arm that achieves a highest upper confidence bound 
(UCB for short): at = argmax^ {fit,a +act,a), where a is 
a tunable parameter that may increase slowly over time. 
In other words, UCB algorithms choose an arm that either 
has a high payoff estimate, or a high estimation uncertainty 
measure (corresponding to large values of act,a)- As more 
data have been collected to refine the payoff estimate, the 
confidence interval vanishes, and the algorithms will behave 
more greedily. With appropriately defined confidence inter- 
vals and parameter a, it can be shown that such algorithms 
have a small total T-trial regret that is only logarithmic in 
the total number of trials T [161 [5]. 

While context-free K-axmed bandits are extensively stud- 
ied and well understood, the more general contextual ban- 
dit problem has largely remained open. The EXP4 algorithm 
and its variants [51 H] use the exponential weighting tech- 
nique to achieve an 0{Vt) regret in expectation, where 

0{x)'^0{x In a;), even if the sequence of contexts and payoffs 
are chosen by an adversarial world, but the computational 
complexity may be exponential in the number of features 
in general. Another general contextual bandit algorithm is 
the epoch-greedy algorithm [TS] that is similar to e-greedy 
with adaptively shrinking e. Assuming the sequence of con- 
texts, xi, . . . ,xt, is i.i.d., this algorithm is computationally 
efficient given an oracle empirical risk minimizer but has the 
weaker regret guarantee of 0{T^^'^) in general, with stronger 
guarantees in various special cases. 

Algorithms with stronger regret guarantees may be de- 
signed under various modeling assumptions about the con- 
textual bandit. Assuming the expected payoff of an arm is 
linear in its features (namely, ED[rt,a \ ^t,a] = w^xt.a for 
some coefficient vector w), both LinRel [4] and LinUCB [191 
1101 111] are essentially UCB-type approaches generalized to 
linear payoff functions, and their variants have a regret of 
0{VT), a significant improvement over earlier algorithms [1] 
as well as the more general epoch-greedy algorithm. Exten- 
sions to generalized linear models [12] are also possible and 
can still enjoy the same 0{\/T) regret guarantee. 



APPENDIX 

One of the simplest and widely used algorithms is e- 
greedy [26]. In each trial t, the algorithm first estimates 
the average payoff fit^a of each arm a. Then, with probabil- 
ity 1 — e, it chooses the greedy arm that has the highest payoff 
estimate: at — argmaxa fit,a', with probability e, it chooses a 
random arm. Clearly, each arm will be tried infinitely often 
in the limit, and so the payoff estimate fit, a converges to the 
true value /Xa with probability 1 as i — >■ cx3. Furthermore, 



